Configurable processor with main controller to increase activity of at least one of a plurality of processing units having local program counters

ABSTRACT

A processor comprises a main controller (CTR 11 ) and a plurality of processing units ( 1–9 ). Each processing unit ( 1–9 ) has a local controller (CTR 1 –CTR 9 ) and at least one functional unit (FU 1 –FU 9 ) controllable by the local controller (CTR 1 –CTR 9 ). The local controller (CTR 1 –CTR 9 ) of a processing unit ( 1–9 ) is coupled ( 15 ) to the main controller (CTR 11 ). The processor further comprises an instruction set, having at least one instruction for increasing the activity of at least one processing unit ( 1–9 ). The main controller (CTR 11 ) is arranged to process the at least one instruction for increasing the activity of at least one processing unit ( 1–9 ). One or more processing units ( 1–9 ) of the processor can be completely switched off, including the corresponding local controller (CTR 1 –CTR 9 ), since the instructions for switching on a processing unit ( 1–9 ) are not processed by the corresponding local controller (CTR 1 –CTR 9 ), but by the main controller (CTR 11 ) itself.

TECHNICAL FIELD

The present invention relates to a processor comprising a maincontroller; a plurality of processing units, each processing unitcomprising a local controller and at least one functional unitcontrollable by the local controller, the local controller being coupledto the main controller; an instruction set having at least oneinstruction for increasing the activity of at least one processing unit.

BACKGROUND

Concurrent processing allows increasing the performance of a processor,and requires some form of parallelism to be introduced in the processorarchitecture. A processor can exploit two forms of parallelism. Thefirst is instruction-level parallelism, in which more than oneinstruction at a time is executed within one task. The second concernstask-level parallelism, in which multiple tasks are executedsimultaneously by the processor. The application that has to be executeddetermines the amount of instruction-level parallelism and task-levelparallelism that can be maximally exploited.

Configurable processors are pre-fabricated devices that can becustomized to perform a specific function. An example of a configurableprocessor is a scaleable VLIW (Very Large Instruction Word) processor,i.e. a VLIW processor with a large number of functional units. A VLIWprocessor allows exploiting instruction-level parallelism in programsand thus executing more than one instruction at a time. Multiple,independent functional units are used to execute multiple operations inparallel. VLIW processors carry out multiple functional unit operationsin response to one very long instruction. Each VLIW effectivelyconfigures the data path of the processor for computations in space,i.e. in parallel.

The flexibility of a VLIW processor can be improved by allowing theprocessor to execute multiple tasks in parallel and thus exploitingtask-level parallelism, if present in an application. In case of atraditional VLIW processor only a single task can be executed, due tothe presence of only a single controller with a corresponding singleprogram counter. VLIW processors with partitioned controllers, however,are capable of exploiting task-level parallelism, and this principle isdescribed in Architecture and implementation of a VLIW supercomputer,Colwell R. et. al., Proc. of Supercomputing '90, New York, N.Y., USA,12–16 Nov. 1990. Each controller controls a segment of the processor andin principle two operation modes are possible. In the first mode thecontrollers operate independently, while in the second mode allcontrollers are locked together. In the first mode the net effect isthat of having a multi-processor system, allowing executing multipletasks simultaneously and thus exploiting task-level parallelism. In thesecond mode, a classical VLIW processor is obtained. It is possible toswitch between both modes during computation.

A problem associated with the introduction of parallelism in a processorarchitecture, among others, is related to the increase in number offunctional units and the corresponding increase in communicationoverhead, as this may result in unnecessary power dissipation if theseresources can not be fully used at a given moment in time. For example,in case of a scaleable VLIW processor with partitioned controllers,functional units will remain unused if not sufficientinstruction-parallelism or task-level parallelism is present in aspecific application. These functional units may still consume asignificant amount of power.

The VLIW processor of U.S. Pat. No. 6,219,796 has processing units thathave been made responsive to a dedicated instruction, e.g. a SLEEPinstruction, which at least partially powers down the associatedexecution unit. The execution units are made active again either byanother dedicated instruction, e.g. a WAKE instruction, or by thereceipt of an active, i.e. a non-SLEEP instruction. Consequently, theactive configuration of the processor can be altered by dedicatedinstructions present in the instruction flow of VLIWs, resulting in areduction of the power consumption by the active processor. Thededicated instructions are inserted into a VLIW by the compiler. This isrealized by first detecting a segment of inactive instructions, e.g.NOPS, for a given functional unit and, subsequently, replacing the firstinactive instruction in the segment by, for example, a SLEEPinstruction, and replacing the last inactive instruction in the segmentwith a WAKE instruction.

It is a disadvantage of the prior art processor that a processing unitcannot be completely switched off. Some control logic will have toremain powered in order to be able to process the instruction for makingthe processing unit active again.

DISCLOSURE OF INVENTION

An object of the invention is to provide a processor architecture thatallows completely switching off one or more processing units. Thisobject is achieved with a processor of the kind set forth, characterizedin that the main controller is arranged to process the at least oneinstruction for increasing the activity of at least one processing unit.

One or more processing units of the processor can be completely switchedoff, including the corresponding local controller, since theinstructions for switching on a processing unit are not processed by thecorresponding local controller, but by the main controller itself.Leakage currents are avoided in a completely switched off processingunit. During computation of an application, processing units can beswitched off and subsequently switched on by the main controller,depending on the amount of instruction-level parallelism and task-levelparallelism present in the application at a given moment in time.Furthermore, performance can be traded for reduction in powerconsumption. In case of computing an application requiring low powerconsumption, this can be achieved by completely switching off one ormore processing units, and thus reducing the power consumption, at theexpense of performance.

An embodiment of the invention is characterized in that the instructionset further has at least one instruction for reducing the activity of atleast one processing unit, and the main controller is arranged toprocess the at least one instruction for reducing the activity of atleast one processing unit. An advantage of this embodiment is that byexecuting one instruction the activity of multiple processing units canbe reduced simultaneously. Furthermore, the complexity of the localcontrollers is reduced.

An embodiment of the invention is characterized in that the instructionfor decreasing the activity of at least one processing unit is aninstruction for completely switching off the processing unit, and theinstruction for increasing the activity of at least one processing unitis an instruction for completely switching on the processing unit. Incase of a processor with a large number of processing units, some of theprocessing units may remain unused if not sufficient instruction-levelparallelism or task-level parallelism is present in a specificapplication. These processing units may still consume a considerableamount of power and by completely switching off one or more processingunits, unnecessary power consumption is prevented. An advantage of thisembodiment is that the operation to switch off or on one or moreprocessing units can be implemented in the hardware of the maincontroller, reducing the VLIW size and decreasing the response time forswitching off and on a processing unit.

An embodiment of the invention is characterized in that the processingunit further comprises a local instruction memory. An advantage of thisembodiment is that it in case a processing unit is switched off, thecorresponding local instruction memory can be switched off as well.

An embodiment of the invention is characterized in that the processingunit further comprises a local program counter. The local programcounter allows a processing unit to operate independently of otherprocessing units.

An embodiment of the invention is characterized in that at least oneprocessing unit further comprises a register file, the register filebeing accessible by the functional unit. A register file is used forstoring input data of a functional unit and allows fast access to thesedata, increasing the performance of a processor.

An embodiment of the invention is characterized in that the registerfile is a distributed register file. An advantage of a distributedregister file is that it requires less read and write ports per registerfile segment, resulting in a smaller register file bandwidth.Furthermore, it improves the scalability of the processor, when comparedto a central register file.

An embodiment of the invention is characterized in that the processorfurther comprises a communication network for coupling the functionalunits of the processing units and the register files of said processingunits. A communication network allows directly passing an output valuefrom a functional unit to a register file, increasing the performance ofthe processor.

An embodiment of the invention is characterized in that thecommunication network is a partially connected communication network,i.e. not each functional unit of the processing units and each registerfile of said processing units are coupled. The use of a partiallyconnected network reduces the code size, due to less addressableregisters, as well as the power consumption. Furthermore, it improvesthe scalability of the processor when compared to a fully connectednetwork.

An embodiment of the invention is characterized in that the processor isa configurable processor. An advantage of a configurable processor isits flexibility, since the processor can be customized to compute on acertain domain of applications, instead of only on a specificapplication.

BRIEF DESCRIPTION OF THE DRAWINGS

The features of the described embodiments will be further elucidated anddescribed with reference to the drawing:

The single FIGURE is a schematic diagram of a VLIW processor inaccordance with an embodiment of the present invention.

DESCRIPTION OF PREFERRED EMBODIMENTS

Referring to the FIGURE, a schematic block diagram illustrates a VLIWprocessor, comprising a plurality of processing units 1–9 and a maincontrol unit 11. Each processing unit 1–7 has a corresponding functionalunit FU1–FU7, a corresponding register file RF1–RF7, a correspondinglocal controller CTR1–CTR7, a corresponding local program counterPC1–PC7 and a corresponding local instruction memory IM1–IM7. Eachregister file RF1–RF7 is accessible by its corresponding functional unitFU1–FU7. Processing unit 9 has a functional unit FU9, a controller CTR9,a local program counter PC9 and a local instruction memory IM9. The maincontrol unit 11 has a main controller CTR11, a program counter PC11 andan instruction memory IM11. The processor further comprises acommunication network 13 for coupling the functional units FU1–FU9 andthe register files RF1–RF7, allowing to pass values from the output ofthe functional units FU1–FU9 to the register files RF1–RF7. The maincontroller CTR11 is coupled to the local controllers CTR1–CTR9 via aconnection 15. The local controllers CTR1–CTR9 are coupled to each othervia connections 17–23.

Each local controller CTR1–CTR9 controls its corresponding processingunit 1–9. The main controller CTR11 controls the main control unit 11.The compiler partitions the VLIW issued to the processor into sixsegments, and each segment is issued to one of the instruction memoriesIM1–IM11. Each local instruction memory IM1–IM9 holds the instructionsto be processed by the corresponding local controller CTR1–CTR9. Thelocal program counter PC1–PC9 refers to the address of the correspondinglocal instruction memory IM1–IM9, where the next instruction to beprocessed is stored. The local controllers CTR1–CTR9 fetch the nextinstruction to be executed from the corresponding local instructionmemory IM1–IM9 and decode the instruction. Subsequently, the resultingoperation code and register addresses are issued to the correspondingfunctional unit and register file, respectively. The instructions to beprocessed by the main control unit 11 are stored in the instructionmemory IM11. The program counter PC11 refers to the address of theinstruction memory IM11, where the next instruction to be processed bythe main controller CTR11 is stored. The main controller CTR11 fetchesthe next instruction to be executed from the instruction memory IM11,decodes the instruction and performs the operations.

In a first mode of operation, the local controllers CTR1–CTR9 run in aso-called lock-step mode. This mode is preferably achieved by making alllocal program counters PC1–PC9 point to the same logical address of thecorresponding local instruction memory IM1–IM9. The actual local programcounter PC1–PC9 is determined by one of the local controllers CTR1–CTR9,and passed to the other local controllers, via the connections 17–23.Since instructions are fetched from the same logical address of thelocal instruction memory IM1–IM9, the processor effectively executes oneVLIW in only one task and therefore behaves as a classical VLIWprocessor. In a second mode of operation each local controller CTR1–CTR9works independently, determining its own value for the local programcounter used for fetching the instructions. In this mode of operation amulti-processor system exists, capable of computing five independenttasks in parallel. In different modes of operation, combinations of thefirst and second mode of operation are possible. For example, localcontrollers CTR1 and CTR3 can run in lock-step mode, executing their owntask and capable of exploiting a limited degree of instruction-levelparallelism. The local controllers CTR5–CTR9 can run in lock-step modeas well, also executing their own task. In this mode, the processorexecutes two independent tasks and within each task a limited degree ofinstruction-level parallelism can be exploited, as compared to the firstmode of operation where all local controllers CTR1–CTR9 run in lock-stepmode.

The main controller processes instructions for reducing the activity aswell as instructions for increasing the activity of one or moreprocessing units. The instructions held in instruction memory IM11contain several fields, indicating the type of operation that has to beperformed by the main controller CTR11, and for which local controllerthis operation holds. For example, in case the activity of processingunits 1 and 3 has to be reduced, an operation called “SLEEP” can beused. This can be realized with an instruction, issued to instructionmemory IM11, which contains a field “SLEEP” and other bit fieldsindicating that this instruction only holds for processing unit 1 and 3.Alternatively, the operation can be encoded in a pre-determined field ofthe instruction, indicating for which processing unit the operationholds. The main controller CTR11 reads the instruction from instructionmemory IM11, decodes the instruction and subsequently reduces theactivity of the processing units 1 and 3. At a later moment in time, theactivity of the processing units 1 and 3 can be increased, by anoperation called “WAKE”. That instruction, issued to instruction memoryIM11, contains a field “WAKE” and other bit fields indicating that thisinstruction only holds for processing unit 1 and 3. The main controllerCTR 11 reads the instruction from instruction memory IM11, decodes theinstruction and subsequently increases the activity of processing units1 and 3.

The compiler detects if a processing unit will have a segment ofinactive instructions. This may be the case if the given applicationdoes not have sufficient instruction-level parallelism or task-levelparallelism to schedule active instructions for all processing units. Atthe beginning of the segment of inactive instructions, an instruction isadded to the VLIW to reduce the activity of the corresponding processingunit, and the main control unit 11 processes the latter instruction. Atthe end of the segment of inactive instructions of a processing unit, aninstruction is added to the VLIW to increase the activity of thecorresponding processing unit and the latter instruction is processed bythe main control unit 11. This instruction may also be added before theend of the segment of inactive instructions in order to allow theprocessing unit to have more time for increasing its activity, beforeactive instructions are scheduled for that processing unit. Theinstruction issued to the main control unit 11 may also cause two ormore processing units simultaneously to reduce their activity.Subsequently, the instructions causing to increase their activity may beexecuted at different points in time, depending on the size of theirsegments of inactive instructions.

In a preferred embodiment the “SLEEP” operation completely switches offone or more processing units 1–9, while the operation “WAKE” completelyswitches on one or more processing units 1–9. An advantage of thisembodiment is that it allows achieving the highest reduction in powerconsumption, as also the corresponding local controller, thecorresponding register file and the corresponding local instructionmemory of a processing unit are powered down. Processing units can becompletely switched off, since the instructions for switching on aprocessing unit do not have to be processed by the corresponding localcontroller, but by the main controller CTR11. Leakage currents in aprocessing unit are avoided as a result of the complete switching off.It is a further advantage of this embodiment that it allows tradingperformance for reduction in power consumption. In case of computing anapplication requiring low power consumption, this can be achieved bycompletely switching off one or more processing units, and thus reducingthe power consumption, at the expense of performance. In otherembodiments, the “SLEEP” operation will only switch off selected partsof one or more processing units 1–9 and the “WAKE” operation will onlyswitch on selected parts of one or more processing units 1–9. Forexample, a “SLEEP” operation may only switch off the functional unit aswell as the corresponding local controller of one or more processingunits. An advantage of this embodiment is that, at the moment a “WAKE”operation is applied, the response time to become ready for use isshorter when compared to a completely switched off processing unit.Furthermore, the local instruction memory can hold its storedinstructions, even if it is a volatile type of memory. In anotherembodiment the local instruction memory may be of a non-volatile type,such as ROM, and this makes it possible to also switch off the localinstruction memory without losing data.

In different embodiments, the instructions processed by the maincontroller 11 may only contain bit fields, for controlling which of theprocessing units 1–9 should be switched off or switched on. Theoperations that have to be performed are fixed and implemented in thehardware of the main controller CTR11. An advantage of this embodimentis that it reduces the VLIW size and decreases the response time forswitching off and on a processing unit.

In other embodiments, several types of instructions can be defined inone instruction set to place processing units into different states ofreduced activity. For example, two power down instructions can bedefined to reduce the activity of a fully active processing unit: oneplacing the processing unit in a state in which it is partiallydeactivated and a second one completely switching off the processingunit. A fully active processing unit can be switched off in one step bymeans of the second power down instruction or in two consecutive stepsby means of the first power down instruction followed by the secondpower down instruction. Furthermore, two power up instructions can bedefined to increase the activity of a processing unit, when completelyswitched off: one placing the processing unit in a state in which it ispartially activated, and a second one causing the processing unit to befully active. The powering up of a completely switched off processingunit can be done in one step by means of the second power up instructionor in two consecutive steps by means of the first power up instructionfollowed by the second power up instruction. In this embodiment one ormore processing units can be placed in a state in which they consumemore power when compared to a complete switch off, but have a shorterresponse time to become ready for use again. An advantage of thisembodiment is that, by placing the processing unit in a state in whichit is partially activated, the delay time necessary for restoringcomplete activity is reduced.

In other embodiments, the instruction for reducing the activity of oneof the processing units 1–9, is issued directly to its correspondinglocal instruction memory IM1–IM9 and processed by its correspondinglocal controller CTR1–CTR9. Subsequently, the main controller may placethe processing unit in a more active state, by processing theinstruction for increasing the activity of one or more processing units.An advantage of this embodiment is that it reduces the complexity of themain controller CTR11 as well as the communication overhead between themain controller CTR11 and the local controllers CTR1–CTR9. A localcontroller can completely switch off its corresponding processing unit,so including itself, since the main controller CTR11 can always placethe processing unit in a more active state.

In an advantageous embodiment, the processing units 1–9 have acorresponding register file RF1–RF7 for storing input data of afunctional unit and allowing fast access to these data.

In a preferred embodiment, the register files RF1–RF7 are distributedregister files, i.e. several register files, each for a limited set offunctional units, are used instead of one central register file for allfunctional units FU1–FU9. An advantage of a distributed register file isthat it requires less read and write ports per register file segment,resulting in a smaller register file bandwidth. Furthermore, it improvesthe scalability of the processor when compared to a central registerfile.

In an advantageous embodiment, a communication network 13 is present,which couples the functional units FU1–FU9 and the register filesRF1–RF7. The communication network 13 allows directly passing an outputvalue of one of the functional units FU1–FU9 to one of the registerfiles RF1–RF7.

In a preferred embodiment, the communication network 13 is a partiallyconnected communication network, i.e. not each functional unit FU1–FU9is coupled to each register file RF1–RF7. The use of a partiallyconnected communication network reduces the code size as well as thepower consumption, and also allows increasing the performance of theprocessor. Furthermore, it improves the scalability of the processorwhen compared to a fully connected communication network.

It should be noted that the above-mentioned embodiments illustraterather than limit the invention, and that those skilled in the art willbe able to design many alternative embodiments without departing fromthe scope of the appended claims. In the claims, any reference signsplaced between parentheses shall not be construed as limiting the claim.The word “comprising” does not exclude the presence of elements or stepsother than those listed in a claim. The word “a” or “an” preceding anelement does not exclude the presence of a plurality of such elements.In the device claim enumerating several means, several of these meanscan be embodied by one and the same item of hardware. The mere fact thatcertain measures are recited in mutually different dependent claims doesnot indicate that a combination of these measures cannot be used toadvantage.

1. A processor comprising: a main control unit including a maincontroller, a main program counter, and a mare instruction memory, themain instruction memory addressed by a main program counter; a pluralityof processing units, each processing unit comprising a local instructionmemory addressed by a local program counter, a local controller and atleast one functional unit controllable by the local controller, thelocal controller being coupled to the main controller; and aninstruction set stored in the main instruction memory having at leastone instruction for increasing the activity of at least one processingunit; characterized in that the main controller is arranged to receivean address within the main instruction memory from the main programcounter, the main controller is further arranged to process the at leastone instruction.
 2. A processor according to claim 1 wherein: theinstruction set further has at least one instruction for reducing theactivity of at least one processing unit; the main controller isarranged to process the at least one instruction.
 3. A processoraccording to claim 2 wherein: the instruction for decreasing theactivity of at least one processing unit is an instruction forcompletely switching off the processing unit; the instruction forincreasing the activity of at least one processing unit is aninstruction for completely switching on the processing unit.
 4. Aprocessor according to claim 1 wherein: at least one processing unitfurther comprises a register file, the register file being accessible bythe functional unit.
 5. A processor according to claim 4 wherein: theregister file is a distributed register file.
 6. A processor accordingto claim 4 wherein: the processor further comprises a communicationnetwork for coupling the functional units of the processing units andthe register files of said processing units.
 7. A processor according toclaim 6 wherein: the communication network is a partially connectedcommunication network.
 8. A processor according to claim 1 wherein: theprocessor is a configurable processor.